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Abstract: A novel coronavirus has been identified as the cause of the outbreak of severe acute respiratory syndrome 
(S ARS). Previous phylogenetic analyses based on sequence alignments show that SARS-CoVs form a new group distantly 
related to the other three groups of previously characterized coronaviruses. In this aritcle. a new approach based on the 2D 
graphical representation of the whole genome sequence is proposed to analyze the phylogenetic relationships of coron¬ 
aviruses. The evolutionary distances are obtained through measuring the differences among the two-dimensional curves. 
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Introduction 

The outbreak of atypical pneumonia, referred to as severe acute 
respiratory syndrome (SARS) was first identified in Guangdong 
Province, China, and spread to several countries later. A novel 
coronavirus was isolated and found to be the cause of SARS. The 
SARS-coronavirus is a new member of the order Nidovirales, fam¬ 
ily Coronaviridae, and genus Coronavirus. Some researchers have 
considered the mutation analysis and phylogenetic analysis. 1-6 

Phylogenetic analysis using biological sequences can be divided 
into two groups. The algorithms in the first group calculate a matrix 
representing the distance between each pair of sequences and then 
transform this matrix into a tree. In the second type of approaches, 
instead of building a tree, the tree that can best explain the observed 
sequences under the evolutionary assumption is found by evalu¬ 
ating the fitness of different topologies. For example. Jukes and 
Cantor, 7 Kimura, 8 Barry and Hartigan, 9 Kishino and Hasegawa, 10 
and Lake 11 proposed various distance measures. Camin and Sokal, 12 
Eck and Dayhoff, 13 Cavalli-Sforza and Edwards, 14 and Fitch 15 
gave parsimony methods. Felsenstein et al. 16-18 proposed maximum 
likelihood methods. 

But. all of these methods require a multiple alignment of the 
sequences and assume some sort of an evolutionary model. In addi¬ 
tion to problems in multiple alignment (computational complexity 
and inherent ambiguity of the alignment cost criteria), these meth¬ 
ods become insufficient for phylogenies using complete genomes. 
Multiple alignment become misleading due to gene rearrangement, 
inversion, transposition, and translocation at the substring level, 
unequal length of sequences, etc, and statistical evolutionary models 


are yet to be suggested for complete genomes. On the other hand, 
whole genome-based phylogenic analyses are appearing because 
single gene sequences generally do not possess enough informa¬ 
tion to construct an evolutionary history of organisms. Factors 
such as different rates of evolution and horizontal gene transfer 
make phylogenetic analysis of species using single gene sequences 
difficult. 

Mathematical analysis of the large volume genomic DNA 
sequence data is one of the challenges for bioscientists. Graphical 
representation of DNA sequence provides a simple way of view¬ 
ing. sorting, and comparing various gene structures. In recent years 
several authors outlined different graphical representation of DNA 
sequences based on 2D, 3D. or 4D. 19-32 Graphical techniques have 
emerged as a very powerful tool for the visualization and analysis 
of long DNA sequences. These techniques provide useful insights 
into local and global characteristics and the occurrences, variations, 
and repetition of the nucleotides along a sequence that are not as 
easily obtainable by other methods. 29-33 Based on these graphical 
representation several authors outlined some approaches to make 
comparison of DNA sequences 34-38 . Recently, we present a new 
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two-dimensional graphical representation ofDNA sequences, which 
has no circuit or degeneracy. 19 

Here, a new approach based on the 2D graphical representation 
of the whole genome sequence is proposed to analyze the phylo¬ 
genetic relationships of genomes. The evolutionary distances are 
obtained through measuring the differences among the 2D curves. 
The examination of the phylogenetic relationships of coronaviruses 
illustrates the utility of our approach. 

2D Graphical Representation of DNA Sequences 

As shown in Figure 1, which is similar with Yan’s 34 method, we con¬ 
struct a pyrimidine-purine graph on two quadrants of the cartesian 
coordinate system, with pyrimidines(T and C) in the first quad¬ 
rant and purines(A and G) in the fourth quadrant. The unit vectors 
representing four nucleotides A,G,C, and T are as follows: 

( m , —\fn) —> A , (s/n, —m) —> G, m) 

—► C, (m, *Jn) —¥ T 



Figure 2. The chimpanzee corresponding curves with different param¬ 
eters n and m. 


where m is a real number, n is a positive real number but not a 
perfect square number. Using this representation, we will reduce 
a DNA sequence into a series of nodes Po,P\,Pi, ■ ■ ■ ,Pn > whose 
coordinates x,,yj(i = 0,1,2,..., N, where N is the length of the 
DNA sequence being studied) satisfy 


{ Xj = a,-m + gi^fn + Cj^/n + 

yi = —ctijn — gitn + Cjtn + Us/Vi 

where cij,Ci,gj and ?; are the cumulative occurrence numbers of A, 
C, G, and T, respectively, in the subsequence from the first base to 
the i-th base in the sequence. We define a 0 = cq = go = to = 0. 

We called the corresponding plot set a characteristic plot set. 
The curve connecting all plots of the characteristic plot set, in turn, 
is called the characteristic curve, which is determined by m,n, that 
satisfy the above mentioned condition. In Figure 2, we show the 
chimpanzee corresponding curves with different parameters n and 



m. Observing Figure 2, we find that chimpanzees have similar curves 
despite corresponding different parameters of n and m. They have 
the same tendency despite different lengths. In Figure 3, we present 
the 2D curves for 24 complete coronavirus genomes (see Table 1) 
with parameters n = 1/2 and m = 3/4 chosen initially by Yan 
et al. 34 

Observing Figure 3, we find that the curves of BCoV, BCoVL, 
BCoVM, and BCoVQ have some similar tendencies. The curves of 
MHV2, MHV, MHVM, and MHVP have some similar tendencies. 
The curves of BJ01, CUHK-SulO, CUHK-W1, SIN2679, SIN2748, 
SIN2774, HKU-39849, SIN2500, SIN2677, TW1, Urbani, and 
TOR2 have some similar tendencies. 


Phylogenetic Tree of Coronaviruses 

For any sequence, we have a set of points (jc,-,y,), i = 1,2,3,... ,7V, 
where N is the length of the sequence. The coordinates of the geo¬ 
metrical center of the points, denoted by x° and y°, may be calculated 
as follows 29 



i=l i=l 


( 1 ) 


The element of the covariance matrix CM of the points are 
defined: 


'CM„ = ±'£%te-xP)(?c i -3p) 

CM;y = i £? (*, - x°)( yi - y°) = CM yx . (2) 

. CM yy = h O'; - y°)(yt - y°) 

The above four numbers give a quantitative description of a set of 
point (xj,yi), i — 1,2,... ,1V, scattering in a 2D space. Obviously, 
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Murine hepatitis virus strain 2 


Murine hepatitis virus strain A59 Murine hepatitis virus strain ML-10 



Figure 3. (A) IBV, BCoV, BCoVL, BCoVM, BCoVQ, HCoV-229E complete genome. (B) MHV2, 
MHV, MHVM, MHVP, PEDV, TGEV complete genome. (C) B.IO 1, CUHK-SulO, CUHK-W1, SIN2679, 
SIN2748, SIN2774 complete genome. (D) HKU-39849, SIN2500, SIN2677, TW1, Urbani, TOR2 com¬ 
plete genome. The two-dimensional curves for 24 complete coronavirus genomes. (A-D) The curves of 
IBV. BCoV, BCoVL, BCoVM, BCoVQ, HCoV-229E, MHV2, MHV, MHVM, MHVP, PEDV, TGEV, 
BJ01, CUHK-SulO, CUHK-W1, SIN2679, SIN2748, SIN2774, HKU-39849, SIN2500, SIN2677, TW1, 
Urbani, and TOR2, respectively. [Color figure can be viewed in the online issue, which is available at 
www.interscience.wiley.com. ] 
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(C) 


SARS coronavirus 8J01 SARS coronavirus CUHK-SulO 




SARS coronavirus CUHK-W1 



SARS coronavirus Sin2774 




Figure 3. (continued) 
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Table 1 . The Accession Number, Abbreviation, Name, and Length for the 24 Coronavirus Genomes. 


No. 

Accession 

Abbreviation 

Genome 

Length(nt) 

1 

(VC_002645 

HCoV_229E 

Human coronavirus 229E 

27,317 

2 

(VC_002306 

TGEV 

Transmissible gastroenteritis virus 

28,586 

3 

(VC_003436 

PEDV 

Porcine epidemic diarrhea virus 

28,033 

4 

U00735 

BCoVM 

Bovine coronavirus strain Mebus 

31,032 

5 

AF391542 

BCoVL 

Bovine coronavirus isolate BCoV-LUN 

31,028 

6 

AF220295 

BCoVQ 

Bovine coronavirus Quebec 

31,100 

7 

(VC_003045 

BCoV 

Bovine coronavirus 

31,028 

8 

AF208067 

MHVM 

Murine hepatitis virus strain ML-10 

31,233 

9 

AF101929 

MHV2 

Murine hepatitis virus strain 2 

31,276 

10 

AF208066 

MHVP 

Murine hepatitis virus strain Penn 97-1 

31,112 

11 

(VC_001846 

MHV 

Murine hepatitis virus 

31,357 

12 

(VC_001451 

IBV 

Avian infectious bronchitis virus 

27,608 

13 

AY278488 

BJ01 

SARS coronavirus BJ01 

29,725 

14 

AY278741 

Urbani 

SARS coronavirus Urbani 

29,727 

15 

AY278491 

HKU-39849 

SARS coronavirus HKU-39849 

29,742 

16 

AY278554 

CUHK-W1 

SARS coronavirus CUHK-W1 

29,736 

17 

AY282752 

CUHK-SulO 

SARS coronavirus CUHK-SulO 

29,736 

18 

AY283794 

SIN2500 

SARS coronavirus Sin2500 

29,711 

19 

AY283795 

SIN2677 

SARS coronavirus Sin2677 

29,705 

20 

AY283796 

SIN2679 

SARS coronavirus Sin2679 

29,711 

21 

AY283797 

SIN2748 

SARS coronavirus Sin2748 

29,706 

22 

AY283798 

SIN2774 

SARS coronavirus Sin2774 

29,711 

23 

AY291451 

TW1 

SARS coronavirus TW1 

29,729 

24 

(VC_004718 

TOR2 

SARS coronavirus 

29,751 


Table 2. The Geometric Center and Two Eigenvectors for each of the 24 Coronavirus Genomes. 


i 

x° 

y° 

EV[ 

M 

EV\ 

i 

8.725 le+003 

567.4895 

(0.0671,-0.9977) 

(-0.9977,-0.0671) 

2 

9.1181e+003 

231.8617 

(0.0265,-0.9996) 

(-0.9996,-0.0265) 

3 

9.1658e+003 

854.0672 

(0.0891,-0.9960) 

(-0.9960,-0.0891) 

4 

9.847 le+003 

678.7491 

(0.0682,-0.9977) 

(-0.9977,-0.0682) 

5 

9.8494e+003 

669.8507 

(0.0683,-0.9977) 

(-0.9977,-0.0683) 

6 

9.8708e+003 

671.8188 

(0.0678,-0.9977) 

(-0.9977,-0.0678) 

7 

9.8504e+003 

667.9839 

(0.0684,-0.9977) 

(-0.9977,-0.0684) 

8 

1.0225e+004 

508.6553 

(0.0456,-0.9990) 

(-0.9990,-0.0456) 

9 

1.0217e+004 

560.8241 

(0.0484,-0.9988) 

(-0.9988,-0.0484) 

10 

1.0166e+004 

571.4215 

(0.0492,-0.9988) 

(-0.9988,-0.0492) 

11 

1.0266e+004 

503.3193 

(0.0457,-0.9990) 

(-0.9990,-0.0457) 

12 

8.8359e+003 

177.6139 

(0.0271,-0.9996) 

(-0.9996,-0.0271) 

13 

9.6653e+003 

217.7081 

(0.0348,-0.9994) 

(-0.9994,-0.0348) 

14 

9.6644e+003 

220.2759 

(0.0347,-0.9994) 

(-0.9994,-0.0347) 

15 

9.6693e+003 

219.4720 

(0.0345,-0.9994) 

(-0.9994,-0.0345) 

16 

9.6690e+003 

217.1652 

(0.0346,-0.9994) 

(-0.9994,-0.0346) 

17 

9.6687e+003 

217.0494 

(0.0346,-0.9994) 

(-0.9994,-0.0346) 

18 

9.6602e+003 

216.5541 

(0.0347,-0.9994) 

(-0.9994,-0.0347) 

19 

9.6587e+003 

216.9280 

(0.0347,-0.9994) 

(-0.9994,-0.0347) 

20 

9.660 le+003 

216.0181 

(0.0346,-0.9994) 

(-0.9994,-0.0346) 

21 

9.6583e+003 

216.5654 

(0.0347,-0.9994) 

(-0.9994,-0.0347) 

22 

9.660 le+003 

216.0584 

(0.0346,-0.9994) 

(-0.9994,-0.0346) 

23 

9.6656e+003 

220.1538 

(0.0347,-0.9994) 

(-0.9994,-0.0347) 

24 

9.6724e+003 

219.6501 

(0.0346,-0.9994) 

(-0.9994,-0.0346) 
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the matrix is a real symmetric 2x2 one. The eigenvectors and their The angle between the two vectors is denoted as follows: 

associated eigenvalues are defined as follows: 


CM ■ EV k = X k ■ EV k , EV k = (EVu , EV k , 2 ) T , k = 1 , 2 . 

Corresponding to each eigenvalue X k , there's an eigenvector 
EVk ■ Corresponding to Ai < X 2 , the two eigenvectors are denoted 
by EVx,, EV;_ 2 , respectively. In Table 2, we list the (jc°,y°) and 
eigenvectors belonging to 24 species with parameters m — j,n = |. 

To facilitate the quantitative comparison of different species 
in terms of their collective parameters, we introduce a distance 
scale and an angle scale as defined below. Suppose that there are 
two species i and j, the parameters are xj\y? ,X\,X 2 ,Xj .yj 1 , A. 3 , 
respectively, where 0c/,>f) is the geometrical center of the curve 
belonging to species i. X\,X 2 are the two eigenvalues of matrix 
CMj corresponding to species i. The distance dy between the two 
points is. 39 



Of, = arccos 
y 


( EV‘ EVj\ 
\\EV‘\-\EVi\) 


i,j = 1,2,...,Af, k=X u X 2 . 

(5) 


The sum of over k for given i,j can be used to reflect the trend 
information of the eigenvectors involved 

% = ^‘+^ 2 - iJ= 1,2,...,A*. (6) 

Consequently, two sets of parameters are obtained. The first 
reflects the difference of center positions represented by the 
Euclidean distance between the geometric centers. The second indi¬ 
cates the difference of the trends of the 2D curve s represented by the 
related eigenvectors. The overall distance Dy between the species 
i and j is defined by 


where dy denotes the distance between the geometric centers of the 
z'th and the jth genomes, and M is the total number of all genomes 
(M = 24, here). Then we obtain a real M x M symmetric matrix 
whose elements are dy. 

To reflect the differences between the trends of every two 2D 
curves, the angles between the corresponding eigenvectors of every 
two genomes are used. The 2D vectors are denoted as follows: 

EV' k = (EV' kl , EV[ 2 ) T , i,j= 1,2,... ,M, k = X i.A.2. (4) 


Dy = dy x 0y, i,j= 1,2(7) 

Accordingly, a real symmetric M x M matrix Dy is obtained 
and used to reflect the evolutionary distance between the species 
i and j. The clustering tree is constructed using the UPGMA method 
in PHYLIP package (http://evolution.genetics.washington.edu/ 
phylip.html). The final phylogenetic tree is drawn using the 
DRAWGRAM program in the PHYLIP package. In Figure 4, we 
present the phylogenetic tree belonging to 24 species. 



Figure 4. Phylogenetic tree. 
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Conclusion 

Most existing approaches for phylogenetic inference use multiple 
alignment of sequences and assume some sort of an evolutionary 
model. The multiple alignment strategy does not work for all types 
of data, for example, whole genome phylogeny, and the evolution¬ 
ary models may not always be correct. Our representation provides 
a direct plotting method to denote DNA sequences without degen¬ 
eracy. From the DNA graph, the A, T, G, and C usage as well as the 
original DNA sequence can be recaptured mathematically without 
loss of textual information. The current 2D graphical representation 
of DNA sequences provides different approaches for constructing 
the phylogenetic tree. Unlike most existing phylogeny construction 
methods, the proposed method does not require multiple alignment. 
Also, both computational scientists and molecular biologists can use 
it to analysis DNA sequences efficiently with different parameters 
of n and m. 
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